Pricing

from $2.00 / 1,000 website analyzeds

Go to Apify Store

Website Markdown Crawler

Try for free

Crawls a website and converts every page to clean Markdown optimized for LLM ingestion.

Pricing

from $2.00 / 1,000 website analyzeds

Rating

0.0

(0)

Developer

Ziad Tarik

Actor stats

Bookmarked

Total users

Monthly active users

6 hours ago

Last modified

Features

Clean Markdown Extraction: Strips noise (navigation, footers) to extract just the main content.
Smart Chunking: Splits content into token chunks respecting paragraph boundaries.
Language Filtering: Can automatically detect and filter pages by language (e.g., only en or fr).
Domain Control: Keeps the crawler scoped to the seed URL's domain.
Regex Exclusions: Skip non-valuable URLs like tags or author pages.

Output Example

Each crawled page yields a structured JSON record:

{
  "url": "https://docs.example.com/getting-started",
  "title": "Getting Started — Example Docs",
  "description": "Learn how to set up Example in 5 minutes.",
  "language": "en",
  "wordCount": 842,
  "tokenEstimate": 1120,
  "headings": [
    { "level": 1, "text": "Getting Started" },
    { "level": 2, "text": "Installation" }
  ],
  "markdown": "# Getting Started\n\nLearn how to...",
  "chunks": [
    { "index": 0, "content": "# Getting Started\n\nLearn how to...", "tokenEstimate": 498 }
  ],
  "chunkCount": 1,
  "depth": 1,
  "crawledAt": "2026-06-10T14:32:00.000Z"
}

Integrations

Connect the crawler directly into your RAG stack.

LlamaIndex

from llama_index.core import Document

# After running the Actor, download dataset as JSON
docs = [
    Document(text=chunk['content'], metadata={'url': item['url'], 'chunk': chunk['index']})
    for item in dataset_items
    for chunk in item['chunks']
]

LangChain

from langchain.docstore.document import Document as LCDoc

lc_docs = [
    LCDoc(page_content=chunk['content'], metadata={'source': item['url']})
    for item in dataset_items
    for chunk in item['chunks']
]

Website to Markdown Crawler for LLM & RAG

logiover/website-text-markdown-crawler

Crawl any website to clean Markdown and plain text for LLM training and RAG. HTML to Markdown, no API or login. Export website text to CSV or JSON.

Logiover

Site to Markdown — any site to clean, LLM-ready markdown

topsail/site-to-markdown

Scrape any website to clean, LLM-ready markdown — a compliant Firecrawl alternative for RAG ingestion, robots.txt always on.

Connor Teskey

LLM Markdown Crawler

sleek_waveform/llm-markdown-crawler

Crawl any website and extract clean, boilerplate-free Markdown optimized for LLMs, RAG pipelines, and AI training datasets. Uses Mozilla Readability to strip navigation and ads, then converts to clean Markdown. No browser required — fast and cheap.

Daniel Dimitrov

AI Web Content Crawler - Markdown for LLMs

intelscrape/ai-web-content-crawler

Crawl any website and extract clean Markdown optimized for LLM training, RAG pipelines, and AI knowledge bases - removes boilerplate and outputs structured JSON with URL, title, markdown, and metadata.

IntelScrape

Simple Website Scrapper (markdown format)

manojaditya64/simple-website-scrapper-markdown-format

A simple website scrapper that scrapes websites and converts it into markdown format which is easy to use with LLM. You can feed markdown data to LLM for easy analysis.

Manojaditya Nadar

5.0

Website to Markdown Converter

lofomachines/website-to-markdown-converter

Best faster and cheaper way to convert any web page into clean, structured, LLM-ready Markdown.

Lofomachines

Markdown API

vivid_astronaut/markdown

Fabio Suizu

Website Content Crawler

crawlerbros/website-content-crawler

Crawls websites and extracts clean text, markdown, or HTML content. Ideal for LLM training data, RAG pipelines, and knowledge base building.

Crawler Bros

🔥 FireScrape AI Website Content Markdown Scraper

mohamedgb00714/fireScraper-AI-Website-Content-Markdown-Scraper

Advanced web scraper powered by Crawlee and Puppeteer — extracts website content, converts it to Markdown, and structures it for LLM training datasets.

mohamed el hadi msaid

302

1.9

Website Content Crawler — AI & RAG Ready

santamaria-automations/website-content-crawler

Crawl any website and extract clean Markdown and plain text optimized for AI ingestion, RAG pipelines, and LLM context. Readability-style main content extraction removes ads, navs, and footers. Configurable depth, concurrency, and page limits. Pay-per-page.